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Abstract 

While alignment of texts on the sentential level 
is often seen as being too coarse, and word align- 
ment as being too fine-grained, bi- or multi- 
lingual texts which are aligned on a level in- 
between are a useful resource for many pur- 
poses. Starting from a number of examples of 
non- literal translations, which tend to make 
alignment difficult, we describe an alignment 
model which copes with these cases by explicitly 
coding them. The model is based on predicate- 
argument structures and thus covers the middle 
ground between sentence and word alignment. 
The model is currently used in a recently initi- 
ated project of a parallel English-German tree- 
bank (FuSe), which can in principle be extended 
with additional languages. 

1 Introduction 

When building parallel linguistic resources, one 
of the most obvious problems that need be 
solved is that of alignment. Usually, in sentence- 
or word-aligned corpora, alignments are un- 
marked relations between corresponding ele- 
ments. They are unmarked because the kind 
of correspondence between two elements is ei- 
ther obvious or beyond classification. E.g., in 
a sentence-aligned corpus, the n : m relations 
that hold between sentences express the fact 
that the propositions contained in n sentences 
in LI are basically the same as the proposi- 
tions in m sentences in L2 (lowest common 
denominator). No further information about 
the kind of correspondence could possibly be 
added on this degree of granularity. On the 
other hand, in word-aligned corpora, words 
are usually aligned as being "lexically equiv- 
alent" or are not aligned at all. Although 
there are many shades of "lexical equivalence" , 



these are usually not explicitly categorised. 



As ( Hansen- S chirr a and N eumann, 2003) point 
out, for many research questions neither type of 
alignment is sufficient, since the most interest- 
ing phenomena can be found on a level between 
these two extremes. 

We propose a more finely grained model 
of alignment which is based on monolingual 
predicate- argument structures, since we assume 
that, while translations can be non-literal in a 
variety of ways, they must be based on simi- 
lar predicates and arguments for some kind of 
translational equivalence to be achieved. Fur- 
thermore, our model explicitly encodes the ways 
in which the two versions of a text deviate from 



each other. ( Salkie, 2002 ) points out that the 
possibility to investigate what types of non- 
literal translations occur on a regular basis is 
one of the major profits that linguists and trans- 
lation theorists can draw from parallel corpora. 

In Section |21 we begin by describing some 
ways in which translations can deviate from 
one another. We then describe in detail the 
alignment model, which is based on a monolin- 
gual predicate-argument structure (Section |3J). 
In Section 0] we conclude by introducing the 
parallel treebank project FuSe which uses the 
model described in this paper to align German 
and English texts from the Europarl parallel 



corpus (Koehn, 2002) 



2 Differences in Translations 

In most cases, translations are not absolutely 
literal counterparts of their source texts. In or- 
der to avoid translationese, i. e. deviations from 
the norms of the target language, a skilled 
translator will apply certain mechanisms, which 



* We would like to thank our colleague Frank Schu- 
macher for many valuable comments on this paper. 
1 Cf. the approach described in (Melamed, 1998). 



( Salkie, 2002 ) calls "inventive translations" and 
which need to be captured and systematised. 
The following section will give some examples 2 



2 As we work with English and German, all exam- 
ples are taken from these two languages. They are taken 



of common discrepancies encountered between 
a source text and its translation. 

2.1 Nominalisations 

Quite frequently, verbal expressions in LI are 
expressed by corresponding nominalisations in 
L2. This departure from the source text results 
in a completely different structure of the tar- 
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get sentence, as can be seen in 
where the English verb harmonise is expressed 
as Harmonisierung in German. The argument 
of the English verb functioning as the grammat- 
ical subject is realised postnominal modifier 
in the German sentence. 

(1) The laws against racism must be har- 
monised. 3 

(2) Die Harmonisierung der 
The harmonisation oLthe 
Rechtsvorschriften gegen den 
laws against the 
Rassismus ist dringend erforderlich. 
racism is urgently necessary. 

This case is particularly interesting, because it 
involves a case of modality. In the English sen- 
tence, the verb is modified by the modal aux- 
iliary must. In order to express the modality 
in the German version, a different strategy is 
applied, namely the use of an adjective with 
modal meaning (erforderlich, 'necessary'). Con- 
sequently, there are two predications in the Ger- 
man sentence as opposed to only one predica- 
tion in the English sentence. 

2.2 Voice 

A further way in which translations can dif- 
fer from their source is the choice of active or 
passive voice. This is exemplified by |(3)| and 
|(4)| Here, the direct object of the English sen- 
tence corresponds to the grammatical subject of 
the German sentence, while the subject of the 
English sentence is realised prepositional 
phrase with durch in the German version. 

(3) The conclusions of the Theato report 
safeguard them perfectly. 4 



from the Europarl corpus (see Section and are ab- 
breviated where necessary. Unfortunately, it is not eas- 
ily discernible from the corpus data which language is 
the source language. Consequently, our use of the terms 
'source', 'target', 'LI', and 'L2' does not admit of any 
conclusions as to whether one of the languages is the 
source language, and if so, which one. 

3 Europarl:de-en/ep-00-01-19.al, 489. 

4 Europarl:de-en/ep-00-01-18.al, 749. 



(4) Durch die SchluBfolgerungen des 
By the conclusions oLthe 
Berichts Theato werden sie 
report Theato are they 
uneingeschrankt bewahrt. 
unlimitedly safeguarded 

2.3 Negation 

Sometimes, a positive predicate expression is 
translated by negating its antonym. This is the 
case in |(5)| and |(6)| both sentences contain a 
negative statement, but while the negation is in- 
corporated into the English adjective by means 
of the negative prefix in-, it is achieved syntac- 
tically in the German sentence. 

(5) the Directive is inapplicable in Den- 
mark 5 

(6) die Richtlinie ist in Danemark nicht 
the Directive is in Denmark not 
anwendbar 

applicable 

2.4 Information Structure 

Sentences and their translations can be organ- 
ised differently with regard to their information 
structure. Sentences |(7)| and |(8)| are a good ex- 
ample for this type of non-literal translation. 

(7) Our motion will give you a great deal of 
food for thought, Commissioner 6 

(8) Eine Reihe von Anregungen werden 
A row of suggestions will 
wir Ihnen, Herr Kommissar, mit 
we you, Mr. Commissioner, with 
unserer EntschlieBung mitgeben 
our resolution give 

The German sentence is rather inconspicuous, 
with the grammatical subject being a prototyp- 
ical agent (wir, 'we'). In the English version, 
however, it is the means that is realised in sub- 
ject position and thus perspectivised. The cor- 
responding constituent in German (mit unserer 
Entschliefiung, 'with our motion') is but an ad- 
verbial. In English, the actual agent is not re- 
alised as such and can only be identified by a 
process of inference based on the presence of the 
possessive pronoun our. Thus, while being more 
or less equivalent in meaning, this sentence pair 
differs significantly in its overall organisation. 



5 Europarl:de-en/ep-00-01-18.al, 2522. 
6 Europarl:de-en/ep-00-01-18.al, 53. 



3 Alignment Model 

The alignment model we propose is based on 
the assumption that a representation of transla- 
tional equivalence can best be approximated by 
aligning the elements of monolingual predicate- 
argument structures. Section ETT1 describes this 
layer of the model in detail and shows how some 
of the differences in translations described in 
Section El can be accomodated on such a level. 
We assume that the annotation model described 
here is an extension to linguistic data which are 
already annotated with phrase-structure trees, 
i.e. treebanks. Section [3.21 shows how the bind- 
ing of predicates and arguments to syntactic 
nodes is modelled. Section l8~8l describes the de- 
tails of the alignment layer and the tags used 
to mark particular kinds of alignments, thus ac- 
counting for some more of the differences shown 
in Section El 

3.1 Predicates and Arguments 

The predicate-argument structures used in our 
model consist solely of predicates and their ar- 
guments. Although there is usually more than 
one predicate in a sentence, no attempt is made 
to nest structures or to join the predications 
logically in any way. The idea is to make the 
predicate-argument structure as rich as is ne- 
cessary to be able to align a sentence pair while 
keeping it as simple as possible so as not to 
make it too difficult to annotate. In the same 
vein, quantification, negation, and other opera- 
tors are not annotated. In short, the predicate- 
argument structures are not supposed to cap- 
ture the semantics of a sentence exhaustively in 
an inter lingua-like fashion. 

To have clear-cut criteria for annotators to 
determine what a predicate is, we rely on the 
heuristic assumption that predicates are more 
likely to be expressed by tokens belonging to 
some word classes than by tokens belonging to 
others. Potential predicate expressions in this 
model are verbs, deverbal adjectives and nouns 7 
or other adjectives and nouns which show a syn- 
tactic subcategorisation pattern. The predicates 
are represented by the capitalised citation form 
of the lexical item (e. g. harmonise). They are 
assigned a class based on their syntactic form 
(v, n, a for 'verbal', 'nominal', and 'adjectival', 
respectively), and derivationally related predi- 



cates form a predicate group. 

Arguments are given short intuitive role 
names (e. g. ent_harmonised, i. e. the entity 
being harmonised) in order to facilitate the 
annotation process. These role names have to 
be used consistently only within a predicate 
group. If, for example, an argument of the pred- 
icate harmonise has been assigned the role 
ent_harmonised and the annotator encoun- 
ters a comparable role as argument to the pred- 
icate harmonisation, the same role name for 
this argument has to be used. 8 

The usefulness of such a structure can be 
shown by analysing the sentence pair (1)| and 



(2) | in Section 12.11 While the syntactic con- 
structions differ considerably, the predicate- 
argument structure shows the correspondence 
quite clearly (see the annotated sentences in 
Figure Utf): in the English sentence, we find 
the predicate HARMONISE with its argument 
ENT_HARMONlSED, which corresponds to the 
predicate harmonisierung and its argument 
harmonisiertes in the German sentence. The 
information that a predicate of the class v is 
aligned with a predicate of the class n can be 
used to query the corpus for this type of non- 
literal translations. 

The active vs. passive translation in sentences 

(3) | and |(4)| is another phenomenon which is ac- 
comodated by a predicate-argument structure 
(Figure EJ) : the subject NP502 in the English 
sentence corresponds to the passivised subject 
NP502 (embedded in PP503) in the German sen- 
tence on the basis of having the same argument 
role (safeguarder vs. bewahrer) in a com- 
parable predication. 

It is sometimes assumed that predicate- 
argument structure can be derived or recov- 
ered from constituent structure or functional 
tags such as subject and object. 10 It is true 
that these annotation layers provide important 
heuristic clues for the identification of predi- 



7 For all non-verbal predicate expressions for which a 
derivationally related verbal expression exists it is as- 
sumed that they are deverbal derivations, etymological 
counter-evidence notwithstanding. 



8 Keeping the argument names consistent for all pred- 
icates within a group while differentiating the predicates 
on the basis of syntactic form are complementary prin- 
ciples, both of which are supposed to facilitate querying 
the corpus. The consistency of argument names within 
a group, for example, enables the researcher to anal- 
yse paradigmatically all realisations of an argument ir- 
respective of the syntactic form of the predicate. At the 
same time, the differentiation of predicates makes possi- 
ble a syntagmatic analysis of the differences of argument 
structures depending on the syntactic form of the pred- 
icate. 

9 A11 figures are at the end of the paper. 
10 See e.g. JMarcus et al., 1994> . 



cates and arguments and may eventually speed 
up the annotation process in a semi-automatic 
way. But, as the examples above have shown, 
predicate-argument structure goes beyond the 
assignment of phrasal categories and grammati- 
cal functions, because the grammatical category 
of predicate expressions and consequently the 
grammatical functions of their arguments can 
vary considerably. Also, the predicate-argument 
structure licenses the alignment relation by 
showing explicitly what it is based on. 

3.2 Binding Layer 

As mentioned above, we assume that the an- 
notation model described here is used on top 
of syntactically annotated data. Consequently, 
all elements of the predicate- argument structure 
must be bound to elements of the phrasal struc- 
ture (terminal or non-terminal nodes). These 
bindings are stored in a dedicated binding layer 
between the constituent layer and the predicate- 
argument layer. 

A problem arises when there is no direct cor- 
respondence between argument roles and con- 
stituents. For instance, this is the case whenever 
a noun is postmodified by a participle clause: in 
Figure El the argument role ent jraised of the 
predicate RAISE is realised by NP525, but the 
participle clause (IPA517) containing the pred- 
icate (raised^) needs to be excluded, because 
not excluding it would lead to recursion. Con- 
sequently, there is no simple way to link the 
argument role to its realisation in the tree. 

In these cases, the argument role is linked to 
the appropriate phrase (here: NP525) and the 
constituent that contains the predicate (1PA517) 
is pruned out, which results in a discontinu- 
ous argument realisation. Thus, in general, the 
binding layer allows for complex bindings, with 
more than one node of the constituent structure 
to be included in and sub-nodes to be explicitly 
excluded from a binding to a predicate or argu- 
ment. 11 

When an expected argument is absent on the 
phrasal level due to specific syntactic construc- 
tions, the binding of the predicate is tagged ac- 
cordingly, thus accounting for the missing argu- 
ment. For example, in passive constructions like 
in Tabled the predicate binding is tagged as pv. 
Other common examples are imperative con- 
structions. Although information of this kind 
may possibly be derived from the constituent 

11 See the database documentation | |Feddes, 2004[ l for 
a more detailed description of this mechanism. 



structure, it is explicitly recorded in the binding 
layer as it has a direct impact on the predicate- 
argument structure and thus might prove use- 
ful for the automatic extraction of valency pat- 
terns. 

Sentence wenn korrekt gedolmetscht wurde 
Gloss if correctly interpreted was 

T 

Binding pv 

I 

Pred/Arg dolmetschen 

Table 1: Example of gged predicate binding 
(Europarl:de-en/ep-00-01-18.al, 2532) 

Note that the passive tag can also be ex- 
ploited in order to query for sentence pairs like 
I (3) I and [(4)1 (in Section l2~2|) . where an active sen- 
tence is translated with a passive: it is straight- 
forward to find those instances of aligned predi- 
cates where only one binding carries the passive 
tag. 

3.3 Alignment Layer 

On the alignment layer, the elements of a pair of 
predicate-argument structures are aligned with 
each other. Arguments are aligned on the basis 
of corresponding roles within the predications. 
Comparable to the tags used in the binding 
layer that account for specific constructions (see 
Section l3~2*|) . the alignments may also be tagged 
with further information. These tags are used 
to classify types of non-literalness like those dis- 
cussed in Se ction s 12.31 an d I2.41 12 

Sentences |(5)| and |(6)| are an example for a 
tagged alignment. As Section 12.31 has shown, 
negation may be incorporated in a predicate in 
LI, but not in L2. Since our predicate-argument 
structure does not include syntactic negation, 
this results in the alignment of a predicate in 
LI with its logical opposite in L2. To account 
for this fact, predicate alignments of this kind 
are tagged as absolute opposites (abs-opp). 

Similarly, alignment tagging is applied when 
predications are in some way incompatible, as 
is the case with sentences |(7)| and |(8)| in Sec- 
tion E3J As can be seen in the aligned annota- 
tion (Figure |IJ, the different information struc- 
ture of these sentences has caused the two cor- 
responding argument roles of GIVER and MIT- 
GEBER to be realised by two incompatible ex- 
pressions representing different referents (NP500 

12 The deviant translations described in Sections 12.11 
and 12.21 are already represented via predicate class (see 
Section f 3 . 1 p and on the binding layer (see Section 2^ . 
respectively. 



vs. wirs). In this case, the alignment between 
the incompatible arguments is tagged incomp. 

If there is no corresponding predicate- 
argument structure in the other language (as 
e. g. the adjectival predicate in sentence |(2)| ) or 
if an argument within a structure does not have 
a counterpart in the other language, there will 
be no alignment. 

Table [2] gives an overview of the annotation 
layers as described in this section. 



Layer- 



Function 



Phrasal constituent structure of language A 

Binding binding J. predicates/arguments to f nodes 

PA predicate-argument structures 

Alignment aligning J predicates and arguments 

PA predicate-argument structures 

Binding binding f predicates/arguments to J, nodes 

Phrasal constituent structure of language B 

Table 2: The layers of the predicate-argument 
annotation 

All elements of the alignment structure are 
supposed to mark explicitly the way they con- 
tribute to or distort the resulting translational 
equivalence of a sentence pair. 13 First and fore- 
most, if two elements are aligned to each other, 
this alignment is licensed by their having com- 
parable roles in the predicate-argument struc- 
tures. This is the default case. If, however, a 
particular alignment relation, either of predi- 
cates or of arguments, is deviant in some way, 
this deviance is explicitly marked and classified 
on the alignment layer. 

4 Application and Outlook 

The alignment model we have described is 
currently being used in a project to build 
a treebank of aligned parallel texts in En- 
glish and German with the following lin- 
guistic levels: POS tags, constituent structure 
and functional relations, plus the predicate- 
argument structure and the alignment layer 
to "fuse" the two - hence our working ti- 
tle for the treebank, FuSe, which addition- 
ally stands for /i/nctional semantic annotation 
( Cyrus et al., 2003||Cy"rus et al., 2004 ). 

Our data source, the Europarl corpus 
( Koehn, 2002 ), contains sentence-aligned pro- 
ceedings of the European parliament in eleven 
languages and thus offers ample opportunity 



Cf. the "translation network" described in 
{Santos, 2000[ l for a much more complex approach 
to describing translation in a formal way; this model, 
however, goes well beyond what we think is feasible 
when annotating large amounts of data. 



for extending the treebank at a later stage. 14 
For syntactic and functional annotation we 
basically adapt the tiger annotation scheme 
(Albert and others, 20031, making adjustments 
where we deem appropriate and changes which 
become necessary when adapting to English an 
annotation scheme which was originally devel- 
oped for German. 

We use Annotate for the semi-automatic 
assignment of POS tags, hierarchical struc- 
ture, phrasal and functional tags ( Brants, 1999[ 
Plaehn, 1998a ). Annotate stores all annota- 
tions in a relational database. 15 To stay consis- 
tent with this approach we have developed an 
extension to the Annotate database structure 
to model the predicate-argument layer and the 
binding layer. 

Due to the monolingual nature of the Anno- 
tate database structure, the alignment layer 
(Section 13. 3|) cannot be incorporated into it. 
Hence, additional types of databases are needed. 
For each language pair (currently English and 
German), an alignment database is defined 
which represents the alignment layer, thus fus- 
ing two extended Annotate databases. Addi- 
tionally, an administrative database is needed 
to define sets of two Annotate databases and 
one alignment database. The final parallel tree- 
bank will be represented by the union of these 
sets (|Feddes, 2004 >. 



While annotators use Annotate to enter 
phrasal and functional structures comfortably, 
the predicate-argument structures and align- 
ments are currently entered into a structured 
text file which is then imported into the 
database. A graphical annotation tool for these 
layers is under development. It will make bind- 
ing the predicate-argument structure to the con- 
stituent structure easier for the annotators and 
suggest argument roles based on previous deci- 
sions. 

Possiblities of semi-automatic methods to 
speed up the annotation and thus reduce the 
costs of building the treebank are currently be- 
ing investigated. 16 Still, quite a bit of manual 



There are a few drawbacks to Europarl, such as its 
limited register and the fact that it is not easily dis- 
cernible which language is the source language. How- 
ever, we believe that at this stage the easy accessibility, 
the amount of preprocessing and particularly the lack of 
copyright restrictions make up for these disadvantages. 

15 For details about the Annotate database structure 
see ^Plaehn, 1998E) . 

16 One track we follow is to investigate if it is feasible to 
have the annotators mark predicate-argument structures 



work will remain. We believe, however, that the 
effort that goes into such a gold-standard paral- 
lel treebank is very much worthwhile since the 
treebank will eventually prove useful for a num- 
ber of fields and can be exploited for numer- 
ous applications. To name but a few, translation 
studies and contrastive analyses will profit par- 
ticularly from the explicit annotation of transla- 
tional differences. NLP applications such as Ma- 
chine Translation could, e.g., exploit the con- 
stituent structures of two languages which are 
mapped via the predicate-argument-structure. 
Also, from the disambiguated predicates and 
their argument structures, a multilingual va- 
lency dictionary could be derived. 
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Figure 2: Active vs. passive voice in translations: an example of a tagged binding (pv) 
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Figure 3: Complex binding of an argument: an example of a pruned constituent (dash-dotted line) 
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Figure 4: Different information structure: an example of a tagged alignment (incomp) 



